Introduction

I was given a dataset to work on as part of a coding challenge for SAP.iO recruitment! It’s a list of ~6000 wines, both white and red. I had to split it up into two seperate ones because R is not the best with memory allocation and my computer simply could not run the CSV file with all 6000 wines + features. This problem can be alleviated in the future by putting it in an AWS instance and use their cloud computers to be able to better run these models.

Preface - I did this mainly with the framework that I learned in my Intro to ML Class (Industrial Engineering 142.) They taught us how to code in R and also go through the steps of pre-processing, running regressions, builing the correlation plot, modeling itself, and then interpreting the confusion matrix.

I’m going to work with two models primarily for this - - The traditional K-nearest neighbors - randomForest.

Pre-processing all the data.

The first thing I did is go into the CSV file itself is to split the red and white wines into two different datasets manually, though this can also be done with code and split into two different data frames.

We will start with the red wine dataset.

# The below lines are to set up R so it uses all of my 
# computer's cores in order to run the models much quicker.
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
registerDoParallel(cores = detectCores() - 1)

# Set seed is useful for creating simluations
set.seed(10)

# Loading all the required libraries for my analysis
library(e1071)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
## 1.0/zoneinfo/America/Los_Angeles'
library(kknn)
## 
## Attaching package: 'kknn'
## The following object is masked from 'package:caret':
## 
##     contr.dummy
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(corrplot)
## corrplot 0.84 loaded
library(kernlab)
## 
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
## 
##     alpha
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Using the read.csv function to read the data
df <- read.csv("red.csv")

# We don't want any empty cells in the data, so we will
# change all of the NA values to 0.
df[is.na(df)] <- 0
str(df)
## 'data.frame':    1599 obs. of  14 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ astringency.rating  : num  0.81 0.86 0.85 1.14 0.81 0.8 0.85 0.79 0.83 0.8 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 0 0 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ vintage             : num  2001 2003 2006 2003 2004 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Running str(df) displays the internal structure of the red wine dataset. It shows that there are 1599 samples and 14 different variables. Everything is of datatype int aside from our response variable quality, which is an integer.

We now going to visualize the data using plots for each of the predictor variables.

for (i in c(1:12)) {
    plot(df[, i], jitter(df[, "quality"]), xlab = names(df)[i],
         ylab = "quality", cex = 0.5, cex.lab = 1)
  
    abline(lm(df[, "quality"] ~ df[ ,i]), lty = 3, lwd = 3)
}

The line on each of these plots displays the linear regression of our response variable quality as a function of each of the predictor variables.

We can see that a few of the regression lines show a very weak association to our response variable. We’ll later split into training and test sets and then we can figure out if we want to keep those features or remove them. I created a correlation plot next to further look at the associations between all the variables.

cor_redwines <- cor(df)
# Had some trouble displaying the graph, so going to save as .png and 
# then show in the R markdown file.
png(height = 1200, width = 1500, pointsize = 25, file = 'red_cor_plot.png')
corrplot(cor_redwines, method = 'number')

Here’s our graph You can see the weak relationships here between quality, citric.acid, free.sulplur dioxide, and also sulphates as shown in the plot. After processing through the data, we can continue on and say that non-linear classification models will be more appropriate than regression, because of all the weak associations shown in the correlation plot.

Building the Model

We need to convert our response variable to factor, and then do the split into training and testing sets.

df$quality <- as.factor(df$quality)

tr <- createDataPartition(df$quality, p = 2/3, list = F)
train_red <- df[tr,]
test_red <- df[-tr,]

We are going to go about this using both k-nearest neighors (KNN), along with randomForest. We will use the caret function which we loaded earlier to tune the model that we can use with the train function. We’ll repeat 5 times.

Caret

Caret simplifies the tuning of the model. The expand.grid argument which we’ll use below combines all of the hyperparameter values into all possible combos.

The Preprocessing

KNN uses distance, so we need to make sure all the predictor variables are standardized. We will use the preProcess argument in the train function for this.

KNN

For KNN, we’ll use 5 kmax, 2 distance, and 3 kernel values. For the distance, 1 is the Manhattan distance, and 2 is the Euclidian distance.

train_ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

kknn_grid <- expand.grid(kmax = c(3, 5, 7, 9, 11), distance = c(1, 2),
                         kernel = c("rectangular", "gaussian", "cos"))

kknn_train <- train(quality ~ ., data = train_red, method = "kknn",
                    trControl = train_ctrl, tuneGrid = kknn_grid,
                    preProcess = c("center", "scale"))
plot(kknn_train)

kknn_train$bestTune
##    kmax distance   kernel
## 26   11        1 gaussian

The best value for k is 7, after the three repetitions.

The randomForest model.

For Rf, the only parameter that we can mess around with is mtry, which is the number of vars which are randomly sampled at each split. We’ll try values of 1 through 13 to pass through the tuneGrid arguement.

rf_grid <- expand.grid(mtry = 1:13)
rf_train <- train(quality ~ ., data = train_red, method = "rf",
                  trcontrol = train_ctrl, tuneGrid = rf_grid, 
                  preProcess = c("center", "scale"))
plot(rf_train)

rf_train$bestTune
##   mtry
## 2    2

A mtry of 4 is the best value to use here.

KNN or randomForest?

Creating the confusion matrix for KNN

kknn_predictor <- predict(kknn_train, test_red)
confusionMatrix(kknn_predictor, test_red$quality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   0   0   0   0   0   0
##          5   0  12 167  68   8   1
##          6   3   5  57 127  33   3
##          7   0   0   3  17  24   2
##          8   0   0   0   0   1   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5989          
##                  95% CI : (0.5558, 0.6408)
##     No Information Rate : 0.4275          
##     P-Value [Acc > NIR] : 1.57e-15        
##                                           
##                   Kappa : 0.3442          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.7357   0.5991  0.36364 0.000000
## Specificity           1.00000  1.00000   0.7072   0.6834  0.95269 0.998095
## Pos Pred Value            NaN      NaN   0.6523   0.5570  0.52174 0.000000
## Neg Pred Value        0.99435  0.96798   0.7818   0.7195  0.91340 0.988679
## Prevalence            0.00565  0.03202   0.4275   0.3992  0.12429 0.011299
## Detection Rate        0.00000  0.00000   0.3145   0.2392  0.04520 0.000000
## Detection Prevalence  0.00000  0.00000   0.4821   0.4294  0.08663 0.001883
## Balanced Accuracy     0.50000  0.50000   0.7215   0.6412  0.65816 0.499048

The confusion matrix for our randomForest model.

rf_predict <- predict(rf_train, test_red)
confusionMatrix(rf_predict, test_red$quality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   0   0   0   0   0   0
##          5   1  11 185  56   4   0
##          6   2   6  41 143  33   3
##          7   0   0   1  13  29   3
##          8   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6723          
##                  95% CI : (0.6306, 0.7121)
##     No Information Rate : 0.4275          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4636          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity           0.00000  0.00000   0.8150   0.6745  0.43939   0.0000
## Specificity           1.00000  1.00000   0.7632   0.7335  0.96344   1.0000
## Pos Pred Value            NaN      NaN   0.7198   0.6272  0.63043      NaN
## Neg Pred Value        0.99435  0.96798   0.8467   0.7723  0.92371   0.9887
## Prevalence            0.00565  0.03202   0.4275   0.3992  0.12429   0.0113
## Detection Rate        0.00000  0.00000   0.3484   0.2693  0.05461   0.0000
## Detection Prevalence  0.00000  0.00000   0.4840   0.4294  0.08663   0.0000
## Balanced Accuracy     0.50000  0.50000   0.7891   0.7040  0.70142   0.5000

For the red wine dataset, the Random Forest Model was the one which performed the best, with an accuracy of almost 70% with a strong Kappa of .4275. The KNN was not better or worse.

==================================================================================================================

Next, the white wine data set

df1 <- read.csv("white.csv")
# changing NA's to 0's.
df1[is.na(df1)] <- 0
str(df1)
## 'data.frame':    4898 obs. of  14 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ astringency.rating  : num  0.72 0.66 0.83 0.74 0.74 0.83 0.65 0.72 0.66 0.83 ...
##  $ residual.sugar      : num  0 0 6.9 0 8.5 6.9 7 0 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ vintage             : num  2004 2004 2006 2004 2007 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Running str(df) on the wine dataset shows that there are 4898 samples, and 14 different variables.

Now going to visualize the data using plots for each of the predictor variables.

for (i in c(1:12)) {
    plot(df1[, i], jitter(df1[, "quality"]), xlab = names(df1)[i],
         ylab = "quality", cex = 0.5, cex.lab = 1)
  
    abline(lm(df1[, "quality"] ~ df1[ ,i]), lty = 3, lwd = 3)
}

The line on each of these plots displays the linear regression of our response variable quality as a function of each of the predictor variables.

Again, there are a few regression lines which show a very weak association. Like before, we will first split into training and test sets and then we can figure out if we want to keep those features or remove them.

cor_white <- cor(df1)
png(height = 1200, width = 1500, pointsize = 25, file = 'white_cor_plot.png')
corrplot(cor_white, method = 'number')

Here’s our graph You can see the weak relationships here between quality, citric acid, residual sugar, free.sulplur dioxide, and also sulphates as shown in the plot. After looking at this data, after processing through the data, we can continue on and say that non-linear classification models will be more appropriate than regression.

Building the Model

We need to convert our response variable to factor, and then do the split into training and testing sets.

df1$quality <- as.factor(df1$quality)
tr_white <- createDataPartition(df1$quality, p = 2/3, list = F)
train_white <- df1[tr_white,]
test_white <- df1[-tr_white,]

We are going to go about this using both k-nearest neighors (KNN), along with randomForest. We will use the caret function which we loaded earlier to tune the model that we can use with the train function. We’ll repeat 5 times.

Caret

Caret simplifies the tuning of the model. The expand.grid argument which we’ll use below combines all of the hyperparameter values into all possible combos.

The Preprocessing

KNN uses distance, so we need to make sure all the predictor variables are standardized. We will use the preProcess argument in the train function for this.

KNN

For KNN, we’ll use 5 kmax, 2 distance, and 3 kernel values. For the distance, 1 is the Manhattan distance, and 2 is the Euclidian distance.

train_ctrl_white <- trainControl(method = "repeatedcv", number = 5, repeats = 5)

kknn_grid_white <- expand.grid(kmax = c(3, 5, 7, 9, 11), distance = c(1, 2),
                         kernel = c("rectangular", "gaussian", "cos"))

kknn_train_white <- train(quality ~ ., data = train_white, method = "kknn",
                    trControl = train_ctrl_white, tuneGrid = kknn_grid_white,
                    preProcess = c("center", "scale"))
plot(kknn_train_white)

kknn_train_white$bestTune
##    kmax distance kernel
## 15    7        1    cos

The best value for k is 9, after the 5 repetitions.

The randomForest model.

For this model, it seems that only the mtry hyperparameter is of use to us. We’ll pass mtry values of 1-13 into the train function’s tuneGrid arg.

rf_grid_white <- expand.grid(mtry = 1:13)
rf_train_white <- train(quality ~ ., data = train_white, method = "rf",
                  trcontrol = train_ctrl_white, tuneGrid = rf_grid_white, 
                  preProcess = c("center", "scale"))
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info =
## trainInfo, : There were missing values in resampled performance measures.
plot(rf_train_white)

rf_train_white$bestTune
##   mtry
## 2    2

A mtry of 3 is the best value to use here.

The Model Selection

We’ll plot the confusion matrix for both of the models to see which model we can use to get some sort of conclusive result from this dataset.

kknn_predict_white <- predict(kknn_train_white, test_white)
confusionMatrix(kknn_predict_white, test_white$quality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   2   9   6   7   0   0   0
##          5   3  28 286 163  13   6   1
##          6   1  14 173 463 112  21   0
##          7   0   3  19  87 158  15   0
##          8   0   0   1  12  10  16   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5721          
##                  95% CI : (0.5477, 0.5963)
##     No Information Rate : 0.4494          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3516          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.166667   0.5897   0.6325  0.53925 0.275862
## Specificity          1.000000 0.990476   0.8129   0.6421  0.90719 0.985360
## Pos Pred Value            NaN 0.375000   0.5720   0.5906  0.56028 0.410256
## Neg Pred Value       0.996317 0.971963   0.8237   0.6817  0.89978 0.973585
## Prevalence           0.003683 0.033149   0.2977   0.4494  0.17986 0.035605
## Detection Rate       0.000000 0.005525   0.1756   0.2842  0.09699 0.009822
## Detection Prevalence 0.000000 0.014733   0.3069   0.4813  0.17311 0.023941
## Balanced Accuracy    0.500000 0.578571   0.7013   0.6373  0.72322 0.630611
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993861
## Prevalence           0.0006139
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000
rf_predict_white <- predict(rf_train_white, test_white)
confusionMatrix(rf_predict_white, test_white$quality)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   3   4   5   6   7   8   9
##          3   0   0   0   0   0   0   0
##          4   0   6   1   1   0   0   0
##          5   3  33 312 120   5   0   0
##          6   3  15 169 574 139  26   1
##          7   0   0   3  37 149  21   0
##          8   0   0   0   0   0  11   0
##          9   0   0   0   0   0   0   0
## 
## Overall Statistics
##                                         
##                Accuracy : 0.6458        
##                  95% CI : (0.622, 0.669)
##     No Information Rate : 0.4494        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.4415        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity          0.000000 0.111111   0.6433   0.7842  0.50853 0.189655
## Specificity          1.000000 0.998730   0.8593   0.6065  0.95434 1.000000
## Pos Pred Value            NaN 0.750000   0.6596   0.6192  0.70952 1.000000
## Neg Pred Value       0.996317 0.970389   0.8503   0.7749  0.89852 0.970952
## Prevalence           0.003683 0.033149   0.2977   0.4494  0.17986 0.035605
## Detection Rate       0.000000 0.003683   0.1915   0.3524  0.09147 0.006753
## Detection Prevalence 0.000000 0.004911   0.2904   0.5691  0.12891 0.006753
## Balanced Accuracy    0.500000 0.554921   0.7513   0.6953  0.73144 0.594828
##                       Class: 9
## Sensitivity          0.0000000
## Specificity          1.0000000
## Pos Pred Value             NaN
## Neg Pred Value       0.9993861
## Prevalence           0.0006139
## Detection Rate       0.0000000
## Detection Prevalence 0.0000000
## Balanced Accuracy    0.5000000

For white wine, the random forest model performed better. We have a 95% CI of (.6239, and .6709), and a Kappa level of 0.4451. KNN did not perform as well. Both did a rather poor job of identifying white wines of the 2 lowest and 2 highest classes. ================================================================================================================== # Finishing up

From our models here, we’ve learned that it’s only accurate to identify very average quality wines, rendering it not very useful. It is quite difficult to conclude that there can be a model that can accurately identify the low and high quality wine.